09. Getting Stopwords from NLTK
Getting Stopwords from NLTK
Question:
Start Quiz:

Solution:
INSTRUCTOR NOTE:
Depending on your setup, downloading the corpus with the GUI (like I do) can be slow and painful. Here's a stack overflow page about downloading it via the command line: http://stackoverflow.com/questions/5843817/programmatically-install-nltk-corpora-models-i-e-without-the-gui-downloader
Note: Version 3.1 of NLTK has a bug with obtaining and downloading the 'panlex_lite' corpus. While this is scheduled to be fixed in version 3.2, you can follow these steps to install this corpus in the meantime:
-
Use
nltk.download('all', halt_on_error=False)
to get all of the corpora except for the 'panlex_lite' corpus. -
You should have a folder on your computer called "nltk_data" which holds all of the downloaded files referenced by
nltk
. (You might find it in your "/Users/ username /" folder.) Save the archived version of the corpus from this link into the "nltk_data/corpora" folder. Warning: The zip file is size 1.7 GB! - Unzip the folder. You should have a file structure that looks like "nltk_data/corpora/panlex_lite/" which contains two files with the unarchived corpus data.
An update to the stopwords corpus in March 2016 updated the number of English stopwords: your answer should be 153 with the most recent corpus data.